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Abstract: We study the convergence of Markov decision processes, composed of a large number 
of objects, to optimization problems on ordinary differential equations. We show that the optimal 
reward of such a Markov decision process, which satisfies a Bellman equation, converges to 
the solution of a continuous Hamilton- Jacobi-Bellman (HJB) equation based on the mean field 
approximation of the Markov decision process. We give bounds on the difference of the rewards 
and an algorithm for deriving an approximating solution to the Markov decision process from a 
solution of the HJB equations. We illustrate the method on three examples pertaining, respectively, 
to investment strategies, population dynamics control and scheduling in queues. They are used to 
illustrate and justify the construction of the controlled ODE and to show the advantage of solving 
a continuous HJB equation rather than a large discrete Bellman equation. 
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Modeles Champ Moyen et Processus de Decision 
Markovien: de I'optimisation discrete a I'optimisation 

continue. 

Resume : Ce document etudie la convergence de processus de decision markoviens composes d'un 
grand nombre d'objets vers des problemes d'optimisation sur des equations difFerentielles. Nous 
montrons que le gain optimal du processus de decision converge vers la solution d'une equation 
continue de type "Hamilton-Jacobi-Bellman" . La preuve utilise a la fois des outils classiqucs des 
modeles champs moyens et difFcrents nouvcaux couplagcs cntrc Ics modclcs discrcts ct contimis qui 
permettent de donner des bornes explicites. La methode est ensuitc illustree par trois exemples 
concernant des strategies d'investissement, du controle de dynamiques de population et un probleme 
d'allocation de ressources. 

Mots-cles : Champ Moyen, Hamilton-Jacobi-Bellman, Controle Optimal, Processus de Decision 
Markovien 
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1 Introduction 

In this paper we study dynamic optimization problems on Markov decision processes composed of 
a large number of interacting objects. 

Consider a system of N objects evolving in a common environment. At each time step, objects 
change their state randomly according to some probability kernel . This kernel depends on the 
number of objects in each state, as well as on the decisions of a centralized controller. Our goal is 
to study the behavior of the controlled system when N becomes large. 

Several papers investigate the asymptotic behavior of such systems, but without controllers. 
For example, in OIH], the authors show that under mild conditions, as N grows, the system 
converges to a deterministic limit. The limiting system can be of two types, depending on the 
intensity I{N) (the intensity is the probability than an object changes its state between two time 
steps). If I{N) = OAr_j.oo(l), the system converges to a dynamical system in discrete time [TS]. If 
I{N) goes to as grows, the limiting system is a continuous time dynamical system and can be 
described by ordinary differential equations (ODEs). 



Contributions 

Here, we consider a Markov decision process where at each time step, a central controller chooses 
an action from a predefined set that will modify the dynamics of the system the controller receives 
a reward depending on the current state of the system and on the action. The goal of the controller 
is to maximize the expected reward over a finite time horizon. We show that when N becomes 
large this problem converges to an optimization problem on an ordinary differential equation. 

More precisely, we focus on the case where the Markov decision process is such that its empirical 
occupancy measure is also Markov; this occurs when the system consists of many interacting 
objects, the objects can be observed only through their state and the system evolution depends 
only on the collection of all states. We show that the optimal reward converges to the optimal 
reward of the mean field approximation of the system, which is given by the solution of an HJB 
equation. Furthermore, the optimal policy of the mean field approximation is also asymptotically 
optimal in N, for the original discrete system. Our method relies on bounding techniques used 
in stochastic approximation and learning [H [T] . We also introduce an original coupling method, 
where, to each sample path of the Markov decision process, we associate a random trajectory that 
is obtained as a solution of the ODE, i.e. the mean field limit, controlled by random actions. 

This convergence result has an algorithmic by-product. Roughly speaking, when confronted 
with a large Markov decision problem, we can first solve the HJB equation for the associated mean 
field limit and then build a decision policy for the initial system that is asymptotically optimal in 
N. 

Our results have two main implications. The first is to justify the construction of controlled 
ODEs as good approximations of large discrete controlled systems. This construction is given 



done without rigorous proofs. In Section 4.3.2 we illustrate this point with an example of malware 
infection in computer systems. 

The second implication concerns the effective computation of an optimal control policy. In the 
discrete case, this is usually done by using dynamic programming for the finite horizon case or by 
computing a fixed point of the Bellman equation in the discounted case. Both approaches suffer 
from the curse of dimensionality, which makes them impractical when the state space is too large. 
In our context, the size of the state space is exponential in TV, making the problem even more 
acute. In practice, modern supercomputers only allow us to tackle such optimal control problems 
when N is no larger than a few tens [5D]. 

The mean field approach offers an alternative to brute force computations. By letting N go 
to infinity, the discrete problem is replaced by a limit Hamilton- Jacobi-Bellman equation that is 
deterministic where the dimensionality of the original system has been hidden in the occupancy 
measure. Solving the HJB equation numerically is sometimes rather easy, as in the examples in 



Sections 4.3.1 and 4.3.2 It provides a deterministic optimal policy whose reward with a finite (but 



large) number of objects is remarkably close to the optimal reward. 
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Related Work 

Several papers in the literature are concerned with the problem of mixing the limiting behavior of 
a large number of objects with optimization. 

In |6j, the value function of the Markov decision process is approximated by a linearly 
parametrized class of functions and a fluid approximation of the MDP is used. It is shown 
that a solution of the HJB equation is a value function for a modification of the original MDP 
problem. In |25l [S], the curse of dimensionality of dynamic programming is circumvented by 
approximating the value function by linear regression. Here, we use instead a mean field limit 
approximation and prove asymptotic optimality in N of limit policy. 

In [9^ the authors also consider Markov decision processes with a growing number of objects, 
but when the intensity is 0(1). In their case, the optimization problem of the system of size N 
converges to a deterministic optimization problem in discrete time. In this paper however, we focus 
on the o(l) case, which is substantially different from the discrete time case because the limiting 
system does not evolve in discrete time anymore. 

Actually, most of the papers dealing with mean field limits of optimization problems over 
large systems are set in a game theory framework, leading to the concept of mean field games 
introduced in [18J. The objects composing the system are seen as N players of a game with 
distributed information, cost and control; their actions lead to a Nash equilibrium. To the best 
of our knowledge, the classic case with global information and centralized control has not yet 
been considered. Our work focuses precisely on classic Markov decision problems, where a central 
controller (our objects are passive), aims at minimizing a global cost function. 

For example, a series of papers by M. Huang, P.E. Gaines and P. Malhame such as pn[T^[T51[H] 
investigate the behavior of systems made of a large number of objects under distributed control. 
They mostly investigate Linear-Quadratic- Gaussian (LQG) dynamics and use the fact that, here, 
the solution can be given in closed form as a Riccati equation to show that the limit satisfies a 
Nash fixed point equation. Their more general approach uses the Nash Equivalence Gertainty 
principle introduced in [IP. The limit equilibrium could or could not be a global optimal. Here, 
we consider the general case where the dynamics and the cost may be arbitrary (we do not assume 
LQG Dynamics) so that the optimal policy is not given in closed form. The main difference with 
their approach comes from the fact that we focus instead on centralized control to achieve a global 
optimum. The techniques to prove convergence are rather different. Our proofs are more in line 
with classic mean field arguments and use stochastic approximation techniques. 

Another example is the work of Tembine and others [531 121] , on the limits of games with 
many players. The authors provide conditions under which the limit when the number of players 
grows to infinity commutes with the fixed point equation satisfied by a Nash equilibrium. Again, 
our investigation solves a different problem and focuses on the centralized case. In addition, our 
approach is more algorithmic; we construct two intermediate systems: one with a finite number of 
objects controlled by a limit policy and one with a limit system controlled by a stochastic policy 
induced by the finite system. 

Structure of the paper 

The rest of the paper is structured as follows. In Section [2] we give definitions, some notation and 
hypotheses. In Section [3] we describe our main theoretical. In Section |4] we describe our resulting 
algorithm and illustrate the application of our method with a few examples. The details of all 
proofs are in Section [5] and Section [6] concludes the paper. 

2 Notations and Definitions 
2.1 System with Objects 

We consider a system composed of N objects. Each object has a state from the finite set SS = 
{1 . . . S}. Time is discrete and the state of the object n at step A: G N is denoted (k). The state 
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of the system at time k is X^(fc) = (Xf (fc) . . . X^{k)) . For all i G SS, we denote by M^(fc) the 
empirical measure of the objects (Xf (fc) . . . X^{k)) at time k: 

^^"^W - ^E'5^„««' (1) 

n=l 

where Sj; denotes the Dirac measure in x. {k) is a probability measure on SS and its 
zth component M^(fc)[«] denotes the proportions of objects in state i at time k (also called the 
occupancy measure): {k)[i] = ^I]„=i '^x^{k)=i- 

The system [X^{k))^^^^ is a Markov process once the sequence of the actions taken by the 
controller is fixed. Let be the transition kernel, namely F^ is a mapping SS^ x SS^ kA — > [0, 1], 
where A is the set of possible actions, such that for every x G SS^ and a G A, F^(a;, .,a) is a 
probability distribution on SS'^ and further, if the controller takes the action (fc) at time t and 
the system is in state X^{k), then: 

V{X^{k + l)=yi...yN\X''{k) = Xi...XN,A''{k)=a) = F^ (xi . . . x^, j/i . . . y^, a) (2) 

We assume that 

(AO) Objects are observable only through their states 

in particular, the controller can observe the collection of all states X^ , X^ , but not the identities 
n = 1, 2, .... This assumption is required for mean field convergence to occur. In practice, it means 
that we need to put into the object state any information that is relevant to the description of the 
system. 

Assumption (AO) translates into the requirement that the kernel be invariant by object re- 
labeling. Formally, let be the set of permutations of {1, 2, N}. By a slight abuse of notation, 
for a e and x € SS^ we also denote with (j{x) the collection of object states after the 

permutation, i.e. <j{x) =^ (a;cr-i(i)...a;cr-i(Ar)). The requirement is that 

r^{a{xha{y),a) = r''{x,y,a) (3) 
for all x,y & SS^ , cr £ &^ and a £ A. A direct consequence, shown in Section [H] is: 
Theorem 1. For any given sequence of actions, the process AI^ (t) is a Markov chain 



2.2 Action, Reward and Policy 

At every time k, a centralized controller chooses an action A'^ {k) S A where A is called the action 
set. {A, d) is a compact metric space for some distance d. The purpose of Markov decision control 
is to compute optimal policies. A policy tt = (ttq, tti, . . . , Tr^, . . . ) is a sequence of decision rules 
that specify the action at every time instant. The policy tt/j might depend on the sequence of past 
and present states of the process X^ , however, it it known that when the state space is finite, 
the action set compact and the kernel and the reward are continuous, there exists a deterministic 
Markovian policy which is optimal (see Theorem 4.4.3 in [21]). This implies that we can limit 
ourselves to policies that depend only on the current state X^ (k). 

Further, we assume that the controller can only observe object states. Therefore she cannot 
make a difference between states that result from object relabeling, i.e. the pol icy depends on 
X^{k) in a way that is invariant by permutation. By Lemma [2] in Section 
M^{k) only. Thus, we may assume that, for every k, tt^ is a function VlSS) 



5.2 



it depends on 
A. Let M^{k) 

denotes the occupancy measure of the system at time k when the controller applies policy tt. 
If the system has occupancy measure (k) at time k and if the controller chooses the action 
(k), she gets an instantaneous reward r^ {M^ (k), A'^ (k)). The expected value over a finite-time 

horizon [0; H^] starting from toq when applying the policy tt is defined by 



N 



(m) E 




r^ {M^{k)MM^{k))) 



M^{0) = m 



(4) 
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The goal of the controller is to find an optimal policy that maximizes the expected value. We 
denote by Vj^ (m) the optimal value when starting from m: 

K^(m)=supV;^(m) (5) 

2.3 Scaling Assumptions 

If at some time k, the system has occupancy measure (k) — m and the controller chooses 
action (k) — a, the system goes into state M'^ {k + 1) with probabilities given by the kernel 
Q^{M^{k),A^{k)). The expectation of the difl^erence between M^(fc + 1) and M^{k) is called 
the drift and is denoted by {m, a): 

(m,a) =^ E [A/^(fc + 1) - {k)\M^ (k) = m, (k) = a] . (6) 

In order to study the limit with N, we assume that goes to at speed I{N) when N goes 
to infinity and that F^ /I{N) converges to a Lipschitz continuous function /. More precisely, 
we assume that there exists a sequence I{N) G (0; 1), N = 1,2,3..., called the intensity of the 
model with limAr_i.oo ^(-^) = and a sequence Iq{N), iV = 1, 2, 3..., also with hmAr_j.oo loi^) = 
such that for all m e ViSS) and a e A: jjj^F^{m,a) - /(m,a) < /o(7V). In a sense, I{N) 
represents the order of magnitude of the number of objects that change their state within one unit 
of time. 

The change of M^{k) during a time step is of order I{N). This suggests a rescaling of time by 

I(N) to obtain an asymptotic result. We define the continuous time process \AI^{t)] as the 

V /teB+ 

affine interpolation of {k), rescaled by the intensity function, i.e. is affine on the intervals 
[kI{N), (fc + l)/(iV)], fc e N and 

M^{kI{N)) = M^{k). 

Similarly, denotes the affine interpolation of the occupancy measure under policy tt. Thus, 
I{N) can also be interpreted as the duration of the time slot for the system with N objects. 
We assume that the time horizon and the reward per time slot scale accordingly, i.e. we impose 



H 



N 



T 



_I{N)_ 

r^{m,a) — I(N)r(m,a) 
for every m G ^{SS) and a G A (where [a;J denotes the largest integer < a;). 

2.4 Limiting System (Mean Field Limit) 

We will see in Section [s] that as N grows, the stochastic system Af^ converges to a deterministic 
limit m^, the mean field limit. For more clarity, all the stochastic variables (i.e., when N is finite) 
are in uppercase and their limiting deterministic values are in lowercase. 

An action function a : [0; T] — > ^ is a piecewise Lipschitz continuous function that associates 
to each time t an action a{t). Note that action functions and policies are different in the sense that 
action functions do not take into account the state to determine the next action. For an action 
function a and an initial condition mg, we consider the following ordinary integral equation for 
m{t), t e M+: 

m{t)-m{0)= / f{m{s),a{s))ds. (7) 



(This equation is equivalent to an ODE, but is easier to manipulate in integral form. In the rest of 
the paper, we make a slight abuse of language and refer to it as an ODE) . Under the foregoing 
assumptions on / and a, this equation satisfies the Cauchy Lipschitz condition and therefore has a 
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unique solution once the initial condition m(0) — mg is fixed. We call t G M+, the corresponding 
semi-flow, i.e. 

m(t) ^ (f>t{mo,a) (8) 

is the unique solution of Eq. ([t]) . 

As for the system with N objects, we define Va{mQ) as the value of the limiting system over a 
finite horizon [0; T] when applying the action fimction a and starting from m(0) = mo: 

d f /"^ 
Jo 

This equation looks similar to the stochastic case Q although there are two main differences. The 
first is that the system is deterministic. The second is that it is defined for action functions and 
not for policies. We also define the optimal value of the deterministic limit t;*(mo): 

u*(mo) = supi;a(mo), (10) 

a 

where the supremum is taken over all possible action functions from [0; T] — > A. 
2.5 Table of Notations 

We recall here a list of the main notations used throughout the paper. 

M^{k) Empirical measure of the system with N objects, under tt, at time k, (Section 

(m, a) .... Drift of the system with TV objects when the state is m and the action is a, Eq.(|6 

f{m,a) Drift of the limiting system (limit of rescaled F^{m,a) as TV — ^ oo), Eq.([ll 

$t(TOo,a) State of the limiting system: <i>t(mo,a) — mo + /($s(?7io, a), Q!(s))(is., Eq.|8 

TT^ Policy for the system with N objects: associates an action a ^ A to each fc, M^(Ej 

a Action function for the limiting system: associates an action to each t: a : [0; T] — )■ 

ir^ Optimal policy for the system with N objects 

a* Optimal action function for the limiting system (if it exists) 

(m) . .Expected reward for the system with N objects starting from m under policy tt, Eq. 

V^{m) Optimal expected value for the system N: Vj^ {m) — sup^ {m) — V^{m), Eq. 

{m) Expected value for the system N when applying the action function a, Eq.(|l2 

Va{m) Value of the limiting system starting from m under action function a, Eq7S9 



u*(m) Optimal value of the limiting system: v^{m) = svLp^Va{m) ~ Va'{m), Eq.(10 



2.6 Summary of Assumptions 

In Section|3]we establish theorems for the convergence of the discrete stochastic optimization problem 
to a continuous deterministic one. These theorems are based on several technical assumptions, 
which are given next. Since SS is finite, the set V{SS) is the simplex in and for m, m! £ V{SS) 
we define ||m|| as the i^-novm of m and (m, m') = i^ii^'i as the usual inner product. 

(Al) (Transition probabilities) Objects can be observed only through their state, i.e., the 
transition probability matrix (or transition kernel) F''^, defined by Eq.(l2|), is invariant under 
permutations of 1 ... A'^. 

There exist some non-random functions Ii{N) and l2{N) such that lim^r-i-oo A(A^) = liniAr_>.oo l2{N) = 
and such that for all m and any policy tt, the number of objects that perform a transition between 
time slot k and fc + 1 per time slot (k) satisfies 

E(Af(fc)|Mf (fc) = m) < Nh{N) 
E(A^)'(A:)2|Mf (fc) = m) < N^I{N)l2{N) 

where I{N) is the intensity function of the model, defined in the following assumption A2. 
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(A2) (Convergence of the Drift) There exist some non-random functions I{N) and Io{N) 
and a function f{m,a) such that UmAr_i.oo I{N) = limAr_>.oo loi^) = and 



^ -F''{m,a)-f{m,a) 



< Io{N) (11) 



I{N) 

f is defined on ViSS) x A and there exists L2 such that |/(m, a)\ < £2- 

(A3) (Lipschitz Continuity) There exist constants Li, K and Kr such that for all m,m' G 
V{SS), a,a' e A: 

\\F'^{m,a)~ F^{m',a)\\ < Li \\m - m'\\ I{N) 

\\f{m,a)~ f{m',a')\\ < K{\\m - m'\\ + d{a, a')) 
\r{m,a) — r{m' ,a)\ < A',. ||m — to'| 

We also assume that the reward is bounded: sup„j |r(m, a)\ ||''|loo 

To make things more concrete, here is a simple but useful case where all assumptions are true. 

• There are constants ci and C2 such that the expectation of the number of objects that perform 
a transition in one time slot is < ci and its standard deviation is < C2, 

• and F^{m,a) can be written under the form jjip {m,a,l/N) where (/? is a continuous 
function on A5 x ^ x [0, e) for some neighborhood As of V{SS) and some e > 0, continuously 
differentiable with respect to m. 

In this case we can choose I{N) = 1/iV, Io{N) = cq/N (where cq is an upper bound to the norm 
of the differential /^(TV) = a/N and /2(iV) = {cf + 4)/N. 



3 Mean Field Convergence 



In Section 3.1 we establish the main results, then, in Section 3.2 we provide the details of the 



method used to derive them. 
3.1 Main Results 

The first result establishes convergence of the optimization problem for the system with N objects 
to the optimization problem of the mean field limit: 

Theorem 2 (Optimal System Convergence). Assume (AO) to (A3). If Muin^ 00 (0) = mo 
almost surely [resp. in probability] then: 

lim T/f (Af^(O)) =t;,(mo) 

almost surely [resp. in probability], where Vj^ and are the optimal values for the system with N 
objects and the mean field limit, defined in Section[^ 



The proof is given in Section |5.6[ 

The second result states that an optimal action function for the mean field limit provides an 
asymptotically optimal strategy for the system with N objects. We need, at this point, to introduce 
a first auxiliary system, which is a system with N objects controlled by an action function borrowed 
from the mean field limit. More precisely, let a be an action function that specifies the action to 
be taken at time t. Although a has been defined for the limiting system, it can also be used in the 
system with N objects. In this case, the action function a can be seen as a policy that does not 
depend on the state of the system. At step k, the controller applies action a{kI{N)). By abuse of 
notation, we denote by , the state of the system when applying the action function a (it will 
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be clear from the notation whether the subscript is an action function or a pohcy). The value for 
this system is defined by 



M^^(0)=mo (12) 



fc=0 

Our next result is the convergence of convergence of and of the value: 

Theorem 3. Assume (AO) to (A3); a is a piecewise Lipschitz continuous action function on [0;T], 
of constant Ka , and with at most p discontinuity points. Let (t) be the linear interpolation of 
the discrete time process . Then for all e > 0: 

p( sup M^it)-Mmo,a) > [||M^(0)-mo||+/^(iV,a)T + e]e^^^l < ^^^1^^ (13) 
Lo<t<T J e 

and 

|C (M'^iO)) - ^;„(mo)| < B' {N, ||M^(0) - mo||) (14) 



where J, Iq and B' are defined in Section 5.1 and satisfy limAr_i.oo Io{N, Q?) = limAr_i.oo J{N, T) = 
and limAr_>oo,<5-i-o B'{N, d) = 0. 

In particular, i/limTv-yoo (0) ~ mo almost surely [resp. in probability] then limjv_j.oo V,^ (Af^(O)) 
Valmg) almost surely [resp. in probability] . 



The proof is given in Section |5.5[ 

As the reward function r(m, a) is bounded and the time-horizon [0; T] is finite, the set of values 
when starting from the initial condition m, {Va{m) : a action function}, is bounded. This set is 
not necessarily compact because the set of action functions may not be closed (a limit of Lipschitz 
continuous functions is not necessarily Lipschitz continuous). However, as it is bounded, for all 
e > 0, there exists an action function a'^ such that Vi,{m) = sup„i)o,(m) < Va^ + e. Theorem [2] 
shows that a'^ is optimal up to 2e for N large enough. This shows the following corollary: 

Corollary 4 (Asymptotically Optimal Policy). Let a* be an optimal action function for the 
limiting system. Then 

N^oo ' ' 

In other words, an optimal action function for the limiting system is asymptotically optimal for the 
system with N objects. 

In particular, this shows that as N grows, policies that do not take into account the state of 
the system {i.e., action functions) are asymptotically as good as adaptive policies. In practice 
however, adaptive policies might perform better, especially for very small values of N . However, it 
is in general impossible to prove convergence for adaptive policies. 



3.2 Derivation of Main Results 
3.2.1 Second Auxiliary System 

The method of proof uses a second auxiliary system, the process (f)t{mo, A^) defined below. It is a 
limiting system controlled by an action function derived from the policy of the original system 
with A'' objects. 

Consider the system with A^ objects under policy tt. The process is defined on some 
probability space VL. To each w G corresponds a trajectory (w), and for each w e fi, we 
define an action function A^ (w). This random function is piecewise constant on each interval 

[kI{N), {k + 1)/(A^)) (fc e N) and is such that A^ {uj){kI{N)) 7rfe(M^(fc)) is the action taken 
by the controller of the system with N objects at time slot fc, under policy tt. 
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Recall that for any toq G V{SS) and any action function a, (j3t{mQ,a) is the solution of the 
ODE I?]). For every w, <j)t{mo, (a;)) is the solution of the limiting system with action function 
A^H, i.e. 

Mmo,A^iu))-mo^ [ fiU^no,A^{uj)),A^{Lo)i.s))ds. 
Jo 

When Ld is fixed, 4>t{mQ, A^ (lu)) is a continuous time deterministic process corresponding to 
one trajectory (w). When considering all possible reahzations of , (j)t{mo, A^) is a random, 
continuous time function "coupled" to . Its randomness comes only from the action term A^ , 
in the ODE. In the following, we omit to write the dependence in w. A^ and will always 
designate the processes corresponding to the same w. 



3.2.2 Convergence of Controlled System 



The following result is the main technical result; it shows the convergence of the controlled system 
in probability, with explicit bounds. Notice that it does not require any regularity assumption on 
the policy tt. 

Theorem 5. Under Assumptions (AO) to (A3), for any e > 0, N > 1 and any policy n: 

p( sup M^it)-Mmo,A^) >[\\M^{0)-mo\\+Io{N)T + e]e'^^A <'^^^^ (15) 
[o<t<T ' \ 



where is the linear interpolation of the discrete time system with N objects) and J is defined 
in Section FOl 



Recall that Iq{N) and J{N, T) for a fixed T go to as — )■ oo. The proof is given in Section 5.3 



3.2.3 Convergence of Value 

Let TT be a policy and A^ the sequence of actions corresponding to a trajectory as we just 
defined. Eq.Q defines the value for the deterministic limit when applying a sequence of actions. 
This defines a random variable w^iv(mo) that corresponds to the value over the limit system when 
using A^ as action function. The random part comes from A^ . E [^^^"(wo)] designates the 
expectation of this value over all possible A^ . A first consequence of Theorem [s] is the convergence 
of (M^(0)) to E [w^jv(mo)] with an error that can be uniformly bounded. 

Theorem 6 (Uniform convergence of the value). Let A^ be the random action function associated 
with , as defined earlier. Under Assumptions (AO) to (A3), 

\V^ (M^(0)) - E [^;^«(mo)] \<B{N, ||M^(0) - mo||) 

where B is defined in Section \5.1\ 

Note that limAr_i.oo,(5-s.o B(N, S) — 0; in particular, z/lim7v->oo M^{0) = niQ almost surely [resp. 
in probability] then \ (i\f"'^(0)) — E [w^iv(mo)] | — > almost surely fresp. in probability] . 

The proof is given in Section |5.4[ 



3.2.4 Putting Things Together 

The proof of the main result uses the two auxiliary systems. The first auxiliary system provides a 
strategy for the system with N objects derived from an action function of the mean field limit; it 
cannot do better than the optimal value for the system with N objects, and is close to the optimal 
value of the mean field limit. Therefore, the optimal value for the system with N objects is lower 
bounded by the optimal value for the mean field limit. The second auxiliary system is used in the 
opposite direction, which shows that, roughly speaking, for large N the two optimal values are the 
same. We give the details of the derivation in Section [5^ 
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4 Applications 

4.1 Hamilton- Jacobi-Bellman Equation and Dynamic Programming 

Let us now consider the finite time optimization problem for tlie stocliastic system and its limit from 
a constructive point of view. As the state space is finite, we can compute the optimal value by using 
a dynamic programming algorithm. If {m, t) denotes the optimal value for the stochastic system 

starting from m at time t/I{N), then U^(m, t) = sup^ E Iy^JJi^iIn) r^iM^{k)) :M^(t):^m . 
The optimal value can be computed by a discrete dynamic programming algorithm |21j by setting 
U^{m,T) = r^(m) and 

C/^(m, t) = sup E (r^(m, a) + U^{M^{t + I{N)), t + I{N))\ (t) = m, {t) = a) . (16) 

Then, the optimal cost over horizon [0;T/I{N)] is V^^{m) — U{m,0). 

Similarly, if we denote by u{m, t) the optimal cost over horizon [i; T] for the limiting system, 
u{m,t) satisfies the classical Hamilton-Jacobi-Bellman equation: 

u(m, t) + max {Vu(m, t).f{m, a) + r(m, a)} = 0. (17) 

a 

This provides a way to compute the optimal value, as well as the optimal policy, by solving the 
partial differential equation above. 

4.2 Algorithms 

Theorem [2] above can be used to design an effective construction of an asymptotically optimal 
policy for the system with N objects over the horizon [0, H] by using the procedure described in 
Algorithm [ij 

Algorithm 1: Static algorithm constructing a policy for the system with N objects, over the 

finite horizon. 

begin 

From the original system with N objects, construct the occupancy measure and its 
kernel and let M^(0) be the initial occupancy measure; 
Compute the limit of the drift of F''^, namely the function /; 



Solve the HJB equation (17) on the interval [0, HI{N)]. This provides an optimal control 
function a{M^ ,t); 

Construct a discrete control n{M^ {k),k) for the discrete system, that gives the action to 
be taken under state M^{k) at step k: 

n{M''(k),k) a(^fe,(^r)(Af^(0),a)). 

return tt; 



Theorem [2] says that under policy tt, the total value is asymptotically optimal: 

lim (M^(0)) = liminf K^(Af^(0)). 

Af— i-oo N^oo 

The policy tt constructed by Algorithm [T] is static in the sense that it does not depend on 
the state [k) but only on the initial state Af^(O), and the deterministic estimation of M^{k) 
provided by the differential equation. One can construct a more adaptive policy by updating 
the starting point of the differential equation at each step. This new procedure, constructing an 
adaptive policy tt' from to the final horizon H is given in Algorithm [2] 

In practice, the total value of the adaptive policy tt' is larger than the value of the static policy 
TT because it uses on-line corrections at each step, before taking a new action. However Theorem [2] 
does not provide a proof of its asymptotic optimality. 



RR n° 7239 



12 



N.Gast & B. Gaujal & J.-Y. Le Boudec 



Algorithm 2: Adaptive algorithm constructing a policy tt' for the system with TV objects, 

over the finite horizon H. 

begin 

M Af^(O); k := 0; 
repeat 



ak{M, •) ~ solution of ^ over [kI{N), HI{N)] starting in M; 
7r'(M,fc) afc(0fc/(Ar)(Af, Offe)); 
M is changed by applying kernel F^; 
k:= k+1; 
until k=H; 
return tt'; 



4.3 Examples 

In this section, we develop three examples. The first one can be seen as a simple illustration of 
optimal mean field. The limiting ODE is quite simple and can be optimized in closed analytical 
form. 

The second example considers a classic virus problem. Although virus propagations concern 
discrete objects (individuals or devices), most work in the literature study a continuous approx- 
imation of the problem under the form of an ODE. The justification of passing from a discrete 
to a continuous model is barely mentioned in most papers (they mainly focus on the study of 
the ODE). Here we present a discrete dynamical system based on a simple stochastic mobility 
model for the individuals whose behavior converges to a classic continuous model. We show on a 
numerical example that the limiting problem provides a policy that is close to optimal, even for a 
system with a relatively small numbers of nodes. 

Finally, the last example comes from routing optimization in a queueing network model of 
volunteer computing platforms. The purpose of this last example is to show that a discrete 
optimal control problem suff'ering from the curse of dimensionality can be replaced by a continuous 
optimization problem where an HJB equation must be solved over a much smaller state space. 

4.3.1 Utility Provider Pricing 

This is a simplified discrete Merton's problem. This example shows a case where the optimization 
problem in the infinite system can be solved in closed form. This can be seen as an ideal case for 
the mean field approach: although the original system is difficult to solve even numerically when 
N is large, taking the limit when N goes to infinity makes it simple to solve, in an analytical form. 

We consider a system made of a utility and N users; users can be either in state S (subscribed) 
or U (unsubscribed). The utility fixes their price a € [0,1]. At every time step, one randomly 
chosen customer revises her status: if she is in state U [resp. S], with probability s{a) [resp. 
a(a)] she moves to the other state; s{a) is the probability of a new subscription, and a{a) is the 
probability of attrition. We assume s(-) decreases with a and a(-) increases. If the price is large, 
the instant gain is large, but the utility loses customers, which eventually reduces the gain. 

Within our framework, this problem can be seen as a Markovian system made of A'^ objects 
(users) and one controller (the provider). The intensity of the model is I{N) = 1/N. Moreover, if 
the immediate profit is divided by N (this does not alter the optimal pricing policy) and if x(t) is 
the fraction of objects in state S at time t and a{t) e [0; 1] is the action taken by the provider at 
time i, the mean field limit of the system is: 

= -x{t)a{a{t)) + (1 - x{t))s{a{t)) = s{a{t)) ~ x{s{a{t)) + a{a{t)) (18) 

and the rescaled profit over a time horizon T is jj" x{t)a{t)dt. Call u.^,{t^ x) the optimal benefit over 
the interval [t, T] if tliere is a proportion a; of subscribers at time t. The Hamilton- Jaccobi-Bellman 
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equation is 

d f d \ 

—u^{t,x)+Hix,—u^{t,x)j = 

with H{x,p) = max [plsia) — xisia) + a(a)) + ax] 
ae[o,i] 

H can be computed under reasonable assumptions on the rates of subscription and attrition s() 
and a(), which can tlien be used to show that there exists an optimal pohcy that is threshold based. 
To continue the rest of this illustration, we consider the radically simplified case where a can take 
only the values and 1 and under the conditions s(0) = a(l) = 1 and s(l) — a(0) = 0, in which 
case the ODE becomes 

dx 

— = -x{t)a{t) + (1 - xit)){\ - ait)) = 1 - xit) - a(t), (19) 
ot 

and H{x,p) = max (a;(l — p), (1 — x)p). The solution of the HJB equation can be given in closed 
form. The optimal policy is to chose action a = lifa;>l/2ora;>l — exp(— (T — t)), and 
otherwise. Figure [l] shows the evolution of the proportion of subscribers x(t) when the optimal 
policy is used. The coloured area corresponds to all the points (t, x) where the optimal policy is 
a = I (fix a high price) and the white area is where the optimal policy is to choose a = (low 
price) . 




Figure 1: Evolution of the proportion of subscribers (y-axis) under the optimal pricing policy. 



To show that this policy is indeed optimal, one has to compute the corresponding value of the 
benefit u{t, x) and show that it satisfies the HJB equation. This can be done using a case analysis, 
by computing explicitly the value of u(t, x) in the zones Zi, Z2, Zj, and Z4 displayed in Figure [l] 
and check that u{t,x) satisfies Eq.(19) in each case. 



4.3.2 Infection Strategy of a Viral Worm 

This second example has two purposes. The first one is to provide a rigorous justification of the 
use of a continuous optimization approach for this classic problem in population dynamics and to 
show that the continuous limit provides insights on the structure of the optimal behavior for the 
discrete system. Here, the optimal action function can be shown to be of the bang-bang type for 
the limit problem, by using tools from continuous optimization such as the Pontryagin maximum 
principle. Theorem |2] shows that a bang-bang policy should also be asymptotically optimal in the 
discrete case. 
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The second purpose is to compare numerically the performance of the optimal policy of the 
deterministic limit a* and the performance of other policies for the stochastic system for small 
values of N. We show that a* is close to optimal even for = 10 and that it outperforms another 
classic heuristic. 

This example is taken from [I5J and considers the propagation of infection by a viral worm. 
Actually, similar epidemic models have been validated through experiments, as well as simulations 
as a realistic representation of the spread of a virus in mobile wireless networks (see [71 [52]). A 
susceptible node is a mobile wireless device, not contaminated by the worm but prone to infection. 
A node is infective if it is contaminated by the worm. An infective node spreads the worm to a 
susceptible node whenever they meet, with probability /?. The worm can also choose to kill an 
infective node, i.e., render it completely dysfunctional - such nodes are denoted dead. A functional 
node that is immune to the worm is referred to as recovered. Although the network operator 
uses security patches to immunize susceptibles (they become recovered) and heals infectives to 
the recovered state, the goal of the worm is to maximize the damages done to the network. Let 
the total number of nodes in the network be N. Let the proportion of susceptible, infective, 
recovered and dead nodes at time t be denoted by S(t), I{t), R{t) and D{t), respectively. Under a 
uniform mobility model, the probability that a susceptible node becomes infected is (31 /N. The 
immunization of susceptibles (resp. infectives) happens at a fixed rate q (resp. b). This means that 
a susceptible (resp. infective) node is immunized with probability q/N (resp. b/N) at every time 
step. 

At this point, authors of [15, invoke the classic results of Kurtz [T7] to show that the dynamics 
of this population process converges to the solution of the following differential equations. 



as 

ii 
Ik 

dt 



= -/3/5 - qS 

= (3IS~bI-v{t)I 

= v{t)I 

= bl + qS. 



This system actually satisfies assumptions (^1,^2,^3), which allows us not only to obtain the 
mean field limit, but also to say more about the optimization problem. The objective of the 
worm is to find v{-) such that the damage function D(T) + f{I(t))dt is maximized under the 
constraint < u < Umax (where / is convex). In |15j . this problem is shown to have a solution and 
the Pontryagin maximum principle is used to show that the optimal solution «*(■) is of bang-bang 
type: 

3ti e [0 . . . T) s. t. v^{t) = for < i < and v^{t) = w^ax for ti<t<T. (21) 

Theorem [2| makes the formal link between the optimization of the model on an individual level 
and the previous resolution of the optimization problem on the differential equations, done in |15j . 
It allows us to formally claim that the policy a* of the worm is indeed asymptotically optimal 
when the number of objects goes to infinity. 

We investigated numerically the performance of a* against various infection policies for small 
values of the number of nodes in the system N. These results are reported on Figure [2| where we 
compare four values: 

• - the optimal value of the limiting system. 

• Vj^ - the optimal expected damage for the system with N objects (MDP problem); 

• - the expected value of the system with TV objects when applying the action function a^, 
that is optimal for the limiting system; Performance of algorithm [l] 

• the performance of a heuristic where, instead of choosing a threshold as suggested by the 



limiting system (21), the killing probability z/ if fixed for the whole time. The curve on the 



figure is drawn for the optimal i> (recomputed for each parameter TV). 

We implemented a simulator that follows strictly the model of infection described earlier in this part. 
We chose parameters similar to those used in [12] : the parameter for the evolution of the system are 
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/? = .6, <7 = .1, 6 = .1, Umax — 1 and the damage function to be optimized is D{T) + ^ /q l'^{t)dt 
with T — 10. However, it should be noted that the choice of thess parameters does not influence 
quahtatively the resuhs. Thanks to the relatively small size of the system, these four quantities can 
be computed numerically using a backward induction. The optimal policies for the deterministic 
limit consists in not killing machines until ti — 4.9 and in killing machines at a maximum rate 



after that time: a*(i) = 1 



{t>4.9}- 









» -It. -I. - - t ( 








V, : Optimal value of the limiting system. 

— •— V^^ : Optimal value rew/ard 

, : Expected value when applying a. 
- e - Expected value for the heuristic 


























V. : Optimal value of the limiting system. 




— •— vl^ : Optimal expected value 


/ 


, , : Expected value when applying a. 



Number ol objects N 

(a) 



Number of objects N 

(b) Same as (a) with j/— axis zoomed around 0.49 



Figure 2: Damage caused by the worm for various infection policies as a function of the size of 
the system TV. The goal of the worm is to maximize the damage (higher means better). Panel (a) 
shows the optimal value u* for the limiting system (mean field limit), the optimal value for the 
system with N objects, the value of the asymptotically optimal policy given in Corollary H 
and the value of a classic heuristic. Panel (b) zooms the axis around the values of the optimal 
policies. 

Theorem pi shows that a* is asymptotically optimal (limAr_j.oo = hmjv-i-oo Vj-f^ = v*), but 
Figure 2(a) snows that, already for low values of iV, these three quantities are very close. A 



classic heuristic for this maximal infection problem is to kill a node with a constant probability 
regardless of the time horizon. Our numerical study shows that a* outperforms this heuristic by 
more than 20%. The performance of this heuristic does not increase with the size of the system N . 



In order to illust rate the convergence of the values and Vj^ 
2(a) where we show the two quantities 



view of Figure 
figure shows t 



to v^, Figure 2(b)^ 
and their common limit v 



is a detailed 
This 



,hat the convergence is indeed very fast. Other numerical experiments indicate 
that this is true for a large panel of parameters. Although this figures seems to indicate that 
^ ^* ^ this is not true in general, for example adding 5D{t) to the damage function leads 
to lA^ < Vj^ < (1/^ is always less than VS^ by definition of V^^). 



4.3.3 Brokering Problem 

Finally, let us consider a model of a volunteer computing system such as BOINC http : //boinc I 
|berkeley . edu/ , Volunteer computing means that people make their personal computer available for 
a computing system. When they do not use their computer, it is available for the computing system. 
However, as soon as they start using their computer, it becomes unavailable for the computing 
system. These systems are becoming more and more popular and provide large computing power 
at a very low cost 

The Markovian model with N objects is defined as follows. The N objects represent the users 
that can submit jobs to the system and the resources that can run the jobs. The resources are 
grouped into a small number of clusters and all resources in the same cluster share the same 
characteristics in terms of speed and availability. Users send jobs to a central broker whose role is 
to balance the load among the clusters. 
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Figure 3: The brokering problem in a desktop grid system, such as Boinc 

The model is a discrete time model of a queuing system. Actually, a more natural continuous- 
time Markov model could also be handled similarly, by using uniformization. 

There are users. Each user has a state x £ {on, off}. At each time step, an active user 
sends one job with probability and becomes inactive with probability Pi/N. An inactive user 
sends no jobs to the system and becomes on with probability Po/N. 

There are C clusters in the system. Each cluster c contains computing resources. Each 
resource has a buffer of bounded size Jc. A resource can either be valid or broken. If it is valid 
and if it has one or more job in its queue, it completes one job with probability jic/N at this time 
slot. A resource gets broken with probability Pb/N. In that case, it discards all the packets of its 
buffer. A broken resource becomes valid with probability Pv/N. 

At each time step, the broker takes an action a G V{{1 . . . C}) and sends the packets it received 
to the clusters according to the distribution a. A packet sent to cluster c joins the queue of one 
of the resources, sqy k; according to a local rule (for example chosen uniformly among the 
resources composing the cluster). If the queue of resource k is full, the packet is lost. The goal of 
the broker is to minimize the number of losses plus the total size of the queues over a finite horizon 
(and hence the response time of accepted packets). 

This model is represented in Figure [3j 

The system has an intensity I{N) 1/N. The number C of clusters is fixed and does not 
depend on N, as well as the sizes Jc of the buffers. However, both the number of users , and 
the number of resources in the clusters , are linear in N. Finally, by construction, all the state 
changes occur with probabilities that scale with 1/N. 

The limiting system is described by the variable mo{t), that represents the fraction of users 
who are on, and the variables qc,j{i) and bc{i) that, respectively, represent the fraction of resources 
in cluster c having j jobs in their buffer and the fraction of resources in cluster c that are broken. 
For an action function q;(-), we denote by ac(-) the fraction of packets sent to cluster c. Finally, let 
us denote by m the fraction of users (both active or inactive) and the fraction of processors in 
cluster c. These fractions are constant (independent of time) and satisfy m + qi + ■ ■ ■ + qc = 1- 



INRIA 



Mean field for Markov Decision Processes 



17 



We get the following equations 
dmo{t) 



dt 
dqcfijt) 
dt 

dt 

dqc,.iAt) 
dt 

dhjt) 
dt 



Pivrioit) +Po{m- nio{t)) 
ac{t)psmo{t) 



pj)c{t) - 
ac{t)psmo{t) 

ac{t)psma{t) 



qcfl{t) + fj.cqc.1 - Pbqcflit) 

He 

{qc,]-i{t) - qc,i{i)) + Mc(gc,i+i - 9cj) - Pbqc.,]{t) 

qc,.h-i{t) - ficqc,j, -Pbqc.,jM 



Pv hc{t)+Pb^qc,j{t). 

3=0 



(22) 
(23) 
(24) 
(25) 

(26) 



where (23) and (25) hold for each cluster c and (24) holds for each cluster c and for all j < Jc- The 



cost associated to the action function a is: 



rf2I1^1^At)+l IE (g,,^(t) + 6,(t)) + J2pbY.jq.At) I dt (27) 

•^0 c=lj = l \c=l ^= 



c=l J=l 



The first part of (27) represents the cost induced by the number of jobs in the system. The second 



part of (27) represents the cost induced by the losses. The parameter 7 gives weight on the cost 



induced by the losses. 



The HJB problem becomes minimizing (27) subject to the variables Ua,qk,i,bk satisfying 



Equations (22) to (26). This system is made of (J + 2)C ODEs. Solving the HJB equation 



numerically in this case can be challenging but remains more tractable than solving the original 
Bellman equation over states. The curse of dimensionality is so acute for the discrete system 
that it cannot be solved numerically with more than 10 processors [5j. 

5 Proofs 

5.1 Details of Scaling Constants 



J{N,T) 
BiN,d) 



dcf 



dct 



loiN) + /(iV)ife(^-^i)^ (^^+2(1 + min(l//(7V),p)) 
8t|l2 [i^{N)I{N)^ + h{N)AT + I{N))] + [2l2{N) + I{N) (/o(Af) + L2)'] } 



/(iV)| 
3 

2l 



Kr{S + Io{N)T) 



Kr 

i7 



dcf 



B'{N,6) = 



I{N) \\r 



Kr[S + I[^{N, a)T 



\\r\\Lj{N,T)'^ 
e^i^ - 1 



3 

2l 



kill Wr)^ 



5.2 Proof of Theorem [T] 



We begin with a few general statements. Let V be the set of probabilities on SS and /i^ : SS'^ V 
defined by ^^{x)i = jf J2n=i l^ri=i foi" * G ^l^o let be the image set of fj.^ , i.e the 
set of all occupancy measures that are possible when the number of objects is TV. The following 
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establishes that if two global states have the same occupancy measure, then they differ by a 
permutation. 

Lemma 1. For all x,x' €E SS^ , if ji^ {x) = [x') there exists some a S &^ such that x' = (j{x). 

Proof. By induction on N . Its is obvious for = 1. Assume the lemma holds for ~ 1 and let 
Xjx' G SS'^ , with (x) — {x'). There is at least one coordinate, say i, such that = xi, 
because there is the same number of occurrences of s = xi in both x and x' . Let y = X2---Xm and 
y' = x'i...x[_^x[^i...x'^. Then ^^~^{y) = therefore there exists some r e S^~^ such 

that y' = T{y). Define a by a{\) = i, a{j) — T{j) + l^^jj^j, for j > 2, so that x' = a{x). Clearly a 
is a permutation of {1, N}. □ 

Let / : SS^ — > E where E is some arbitrary set. We say that / is invariant under 6^ if 
f o a — f for all a S 6^. The following results states that if a function of the global state is 
invariant under permutations, it is a function of the occupancy measure. 

Lemma 2. // / : SS^ E is invariant under & then there exists f : — t- E such that 

Proof. Define / as follows. For every m e pick some arbitrary xg G (/i^)^^(to) and let 
/(m) — f{xf)). Now let x, perhaps different from Xq, such that /i^(x) — m. By Lemma[l] there 
exists some a € &^ such that x = a{xo) therefore f{x) = /(xq) = f{fi^{x)). This is true for every 
m€P^ thus fix) = fin^ix)) for every x e SS^. □ 

The sequence of actions is given and N is fixed. We are thus given a time-inhomogeneous 
Markov chain on 55^, with transition kernel Gk, /c € N, given by Gk{x,y) — {x,y,ak), 
such that for any permutation cr G &^ and any states x,y we have 

Gk{a{x),a{y)) = Gk{x,y) (28) 

Let J^ik) be the a— field generated by X^{s) for s < fc and G{k) be the a— field generated by 
M^(s) for s<k. Note that because = o , g{k) C F{k). 

Pick some arbitrary test function : SS^ — > M and fix some time fc > 1; we will now 
compute E (v3(M^(fc))| J'(fc - 1)). Because is a function of X'^ and is a Markov chain, 
E (^(Af^(fc))| J"(fc - 1)) is a function, say V, of X^(fc - 1). We have, for any fixed a € 6^: 

^(x) Gk{x,y)^{n^{y))^ ^ G,{x,a{y))^ (fi^ {a{y))) 

= Gfc(x,a(y))vp(/.^(2/)) 

V-Calx)) = ^ Gfc(a(x),cT(y))^(Ai^(y))= ^ G,(x, y))^ 
yess" yeSS" 

where the last equality is by Eq.(|28[). Thus ip{a{x)) = ip{x) and by Lemma [2] there exists some 
fimction ^ such that ^(x) = ?/j (/i^x)), i.e. 

E((p(M^(fc))| J-(fc- 1)) = V;(M^(fc- 1)) (29) 

In particular, E (^{p{M^ {k)) \ T{k — 1)) \s Q{k — 1)— measurable. Now 

E(^(M^(A:))|^(fc-l)) = K{¥.{^{M^{k))\F{k~l))\g{k-l)) 

= E(Vi(M^(fc-l))|g(fc-l)) =i!{M^{k-l)) 

which expresses that is a Markov chain. 
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5.3 Proof of Theorem [5] 

The proof is inspired by the method in [5] . The main idea of the proof is to write 



M^(k)-<i,kiiN){m^,A^)\\ < 



fc-i 



Mf(fc)-A/^(0)-5:r(j) 



+ 



k-l 



M^(0) + ^ /^(j) - '^fc/w(mo, ) 



where f^{k) 7rfc(M^ (fc))] is the drift at time k if the empirical measme is (fc). 

The first part is bounded with high probabihty using a Martingale argument (Lemma |4| and the 
second part is bounded using an integral formula. 

Recah that M^{t) =^ ( 7(lv)J)' i-^- {kI{N)) = M^{k) for fc e N and is 

piecewise constant and right-continuous. Let (k) be the number of objects that change state 
between time slots k and fc + 1. Thus, 



and thus 



as well, with k — 



M^{k + 1) - M^{k)\\ < iV-V2Af (fc) 



I{N)\- 



Define 



(30) 
(31) 

(32) 

and let (t) be the continuous, piecewise linear interpolation such that {kI(N)) = Z^ (fc) 

for fc e N. Recall that (t) T:yt/i(N)\{M^ {[t/ I{N)\)) - {t) is the action taken by the 
controller at time t/I{N). It follows from these definitions that: 



fc-i 



Zf(fc) = Mf(fc)-M^(0)-5:^^^(Mf(j),.,(M.^(,))) 



Mi^(i) = A/i^(0) 



I{N) 
1 

W)' 



{M:^{s)^A^{s))ds + Z^{t) 



F^{M^{s),A:{s)]ds + Z-{t) 



,N 



{M^{s\A^{s))~F'' [M^{s),A^{s) 



ds 



/o 

Using the definition of the semi-flow (f>t{mQ, A^) = toq -I- f{4>s{'mQ, A^), A^ {s))ds, we get: 



Mi^t)-Mmo,A':) = Mi'{0)-mo + Zi^t) 



1 

iW) 
tv 1 



F 



N 



(Mf (.), A^{s)) - F^ {Mmo, A^), (.)) 



ds 







liN) 



F"" (0,(mo, ), (s)) - / {Umo, A^),A^ (s)) 



ds 



1 



HN) 



F 



N 



(M,^(s),Af (.)) -F^ [M^islA^is)) 



ds 
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Applying Assumption (A2) to the third hne, (A3) to the second and fourth hues, and Equation (31) 
to the fourth hne leads to: 



Li 



ds 



A:=0 



For all TV, vr, T, 61 > and 62 > 0, define 

k 



UJ 



en-, sup ^A^{j)>bA , n2 = Ken-, sup (fc)|| > 62 



(33) 



0<fc<7(iv) j=0 



0<k<j^ 



Assumption (Al) implies conditions on the first and second order moment of (fc). Therefore by 
Lemma [3j this shows that for any bi > 0; 



p(r2i) < 

Moreover, we show in Lemma [4] that: 



^/.W + ^(T + /(^)) 



.T 



liNy 



V{n2) < 25^-2 2l2{N) + I{N)[iIo{N) + L2)Y 



(34) 



(35) 



Now fix some e > and let bi = ^ ^rr,.r. , h = e/2. For uj en \ (Qi U ^,2) and for < t < T: 
M^{t)~Mmo,A^) < llMf (0)-mo|| +e + /o(7V)T 



ds 



By Gronwall's lemma: 

M^{t)-Mmo,A^)\\ < [||A4^(0)-mo||+e + /o(iV)r]e^^* (36) 

and this is true for aU w G 17 \ {fli U 172)- We apply the union bound P {Qi U 1^2) < P (^^i) + P (^^2) 
which, with Eq. ( 34 1 and Eq. ([35|) , concludes the proof. 



The proof of Theorem [5] uses the following lemmas. 

Lemma 3. Let (M^fc)j.gpj be a sequence of square integrable, non-negative random variables, adapted 
to a filtration {J^k)^^-^, such that Wq = a.s. and for all k € N.' ¥.[Wk+i\Fk) < a and 
E ( Wl^^ \j'k) <P- Then for all n E N and b > 0: 



sup {Wa + ... + Wk) >b] < 

0<k<n 



n/3 + n{n + l)a'^ 



(37) 



Proof Let r„ = Ylk=o ^k- It follows that E (y„) < an and 

E(i;\i) </3 + 2na2+E(y„2) 

from where we derive that 

Now, because Wn+i > 0: 

E ( Y^+, I > (E ( r„+i I J-„))' = (r„ + E ( Ty„+i I > Yl 



(38) 



INRIA 



Mean field for Markov Decision Processes 



21 



thus is a non-negative sub-martingale and by Kolmogorov's inequality: 



P\ sup Yk>b]=P{ sup >V]< 

\0<k<n J \o<k<7i 



E r 2 



62 



Together with Eq. ( 38 1 this concludes the proof. 



Lemma 4. Define as in Eq.(32). For all N > 2, b > 0, T > and all policy n: 



sup \\Zi'{k)\\>b \ <2S''^ 2l2{N)+I{N)[{Io{N) + L2)] 



,0<fe<L7(W)J 



□ 



Proof. The proof is inspired by the methods in [11. For fixed N and h e M"^, let 

Lk = {h,Z^{k)} 

By the definition of Z^ , is a martingale w.r. to the filtration {Fk)i.^^ generated by . Thus 



E ( [L 



k+l 



E{{h,M^ik + 1) - Afi^(fc)>2| J-,) + (h,F^ (Aff (fc),^,(Mf (fc)))) 



By Assumption (A2): 



{h,F^ (Mi^(fc),^(Mf (fc))))| < {IoiN)+L2)I{N) \\h\\ 



Thus, using Eq.(30) and Assumption (Al): 

E((Lfe+i-Lfc)'| J-fe) < \\hf[N-^2E{A^{kf\Tk) + [{Io{N) + L2)I{N)Y 
< \\hf [2I{N)hiN) + [(/o(iV) + L2) I{N)f 

We now apply Kolmogorov's inequality for martingales and obtain 



sup Lk>b\<- \\h\\' 2I{N)hiN) + [{loiN) + L2) I{N)] 

0<k<n I 0^ 



Let Eh be the set of w e such that supq<j.<„(/i, Z^ (k)) < b and let ^ := f]h^j.g. 5 
where is the ith vector of the canonical basis of M.^ . It follows that, for all € S and < fc < n 
and i = 1 ... 5: | (Z^(fc), e,) | < b. This means that for all uj eE: \\Z^ {k)\\ < VSb. By the union 
bound applied to the complement of 2, we have 

1 - P(S) < 2S^ [l{N)l2{N) + [(/o(iV) + L2) I{Nf 



Thus we have shown that, for all 6 > 0: 



sup \\Z^{k)\\ >^b) <2S ,2 

0<fe<n / b 



nI{N) 



I2{N)+I{N) [{Io{N) + L2)] 



which, by changing b to b/\/~S, shows the result. 



□ 
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5.4 Proof of Theorem |6] 

We use the same notation as in the proof of Theorem [5] By definition of V'^ , v and the time 
horizons: 



(Af^(0))-E(t;^«(mo)) 



E 



-E 



T 



r(m^«(s),Af(.))d5l 



The latter term is bounded by I{N) \\r\\^. Let e > and ilp = f^i U SI2 where fii, are as in 
the proof of Theorem [s] Thus P(i7o) ^ •^W^) using the Lipschitz continuity of r in m (with 
constant Kr): 



|yj^(M^(0)) - E [vA^imo)] I < I{N) ||r||. 



2\\r\\^JiN,T) 



^i^0-io \\M^ (s) - rriAN {s)\\ ds 



ds < and, by Eq.(|36l 



For Lu^rio and s € [0, T]: M^{s) - M^{s) 

/(f (s) -m^iv(s) ds < (II M^(0)- moll +/o(iV)r + e) s^:^ thus 

|yj^(M^(0)) - E K«(mo)] I < B,{N, ||M^(0) - moll) 



(39) 



where 



dcf 



B,{N, 5) = I{N) ||r|L + Kr [5 + h{N)T + e) 
This holds for every e > 0, thus 

rN (i^fN 



1^-1 , KrI{N) , 2||r||^ J(iV,r) 



2Li 



yj^(M^(0)) -E [«^«(mo)] I < B{N, ||M^(0) - mo||) 



(40) 



where B{N^ 5) infe>o B^{N, 5). By direct calculus, one finds that infg>o (ae + — >^/2s a^b^ 
for a > 0, 6 > 0, which gives the required formula for B{N, S). 

5.5 Proof of Theorem |3] 

Let be the right-continuous function constant on the intervals [kI{N); [k + 1)I{N)) such that 
(s) — a{s). can be viewed as a policy independent of m. Therefore, by Theorem [sj on the 
set \ (f^i U VI2), for every t € [0; T]: 

M^{t)-4>t{mo,a) < [||Af^(0)-mo||+/o(iV)T + e]e^i^ + 7/(i) 



with u(t) |(/)t(mo, a^) — (/)t(mo,a)|. We have 



uit) < / /(0s(TOo,a),a(s)) - /(0s("io,a ),a (s)) Us 



X (||(/)s(mo, a) — 0s(mo, Q!^)|| + (i(a(s), a^(s)) ds 

< K [ u{s)ds + Kdi 
Jo 



where di '=Jq Wo^it) — a^{t)^dt. Therefore, using Gronwall's inequality, we have u{t) < Kdie 
By Lemma [5] this shows Eq.(13). The rest of the proof is as for Theorem [6j 



KT 
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Lemma 5. If a is a piecewise Lipschitz continuous action function on [0; T], of constant Ka, and 
with at most p discontinuity points, then 

^ d{a{t),a^{t))dt < TI{N) +2(1 + min(l//(7V),p)) ||a| 

Proof of lemma Let first assume that T = kI{N). The left handside di = d{a{t),a'^ {t))dt 
can be decomposed on all intervals [iI{N), {i + 1)/(A^)): 

lT/I{N)i .(^+l)^N) lT/I{N)i .(»+i)/(Ar) 

di = y2 \\a{s)-a^{s)\\ds< V / \\a{s) - a{iI{N))\\ ds 

,^0 J^I(N) J^I{N) 

If a has no discontinuity point on [iI{N), {i + 1)/(A^)), then 

r(i+l)/(Af) pHN) 
lil(N) 

If a has one or more discontinuity points on [iI{N), {i + 1)I{N)), then 



/ d{a{s),a{iI{N)))ds < K^sds < Kc,2I{Nf 

Jil(N) Jo 



{i+l)I(N) i.{i+l)I(N) 

d{ais)a{iIiN)))ds < 2\\a\\^ds<2 \\a\\^ /(TV) 

iI(N) JiI{N) 

There are at most min(l//(A^),p) intervals [iI{N), {i + l)/(iV)] that have discontinuity points 
which shows that 

di < TI{N)[^ + min(l//(iV),p)2 
If T 7^ kI{N), then T = kI{N) + 1 with < i < I{N). Therefore, there is an additional term 



□ 



of /fe/W < 2||a||^/(7V). 

5.6 Proof of Theorem [2] 

This theorem is a direct consequence of Theorem [3] and Theorem |6j We do the proof for almost 
sure convergence, the proof for convergence in probability is similar. To prove the theorem we 
prove 

hm sup V;^ (M^ (0)) < (too) < Hm inf VS" (M^ (0)) (41) 



N- 



• Let e > and a(.) be an action function such that Wc((too) > u*(too) — e (such an action is 
called e— optimal). Theorem [3] shows that limAr^oo {M^ (Q)) = Va{mo) > u*(too) — e a.s. 
This shows that liminfAr^oo V^^{M^{0)) > liniAr^oo Fi^(M^(0)) > w*(too) - e; this holds 
for every e > thus liminfAr^oo ^ f*(TOo) a.s., which establishes the second 
inequality in Eq.(41), on a set of probability 1. 

• Let B{N,S) be as in Theoremje) e > and tt^ such that Vj^{M^{0)) < V^^(M^(0)) + e. 
By Theorem [el yj^(M^(0)) < E (^w^™^ (toq)) + B{N,S'^) < f*(mo) + B{N,S'^) where 

del ii^^ATj-Q^ - moll . Thus Vf{M^{0)) < w*(too) + B{N,d^) + e. If further S'^ 
a.s. it follows that liinsup]s^^^Vj^ {M^ (0)) < u*(too) + e a.s. for every e > 0, thus 
limsup^^^K'^(M^(0)) < v^imo) a.s. 



6 Conclusion and Perspectives 

There are several natural questions arising from this work. One concerns the convergence of optimal 
policies. Optimal policies tt^ of a stochastic systems with N objects may not be unique, they may 
also exhibit thresholds and therefore be discontinuous. This implies that and V^^ will not 
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converge in general. In some particular cases, such as the best response dynamics studied in |10| . 
limit theorems can nevertheless be obtained, at the cost of a much greater complexity. In full 
generality however, this problem is still open and definitely deserves further investigations. 

The second question concerns the time horizon. In this paper we have focused on the finite 
horizon case. Actually, most results and in particular theorems [2] and |3j remain valid with an 
infinite horizon with discount. The main argument that makes everything work in the discounted 
case is the following. When the rewards r(s, a) are bounded, for a given discount /3 < 1 and a 
given e > 0, it is possible to find a finite time horizon T such that the expected discounted value of 
a policy tt can be decomposed into the value over time T plus a term less than e: 

T 

E^/3V(M^(t),7r(M^(t)) < E^;3V(M^(t),7r(Af^(i)) + £. 
t>o t=o 
Therefore, the main result of this paper, which states that a policy tt that is optimal in the mean 
field limit is near-optimal for the finite system with N objects, also holds in the infinite horizon 
discounted case. 

As for the infinite horizon without discount or average reward cases, convergence of the value 
when A'^ goes to infinity is not guaranteed in general. Finding natural assumptions under which 
convergence holds is also one of our goals for the future. 
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